5420 ANOMALY DETECTION
A8: Modeling Strategies Assignment & Supervised ML
Zeying Liu (zl3121)
import numpy as np
import pandas as pd
df = pd.read_csv('XYZloan_default_selected_vars.csv')
df.tail()
# Check the distribution of y
df['loan_default'].value_counts()
# Check all variables' name
df.columns
# Drop the duplicate columns
df = df.drop('Unnamed: 0', axis='columns')
df = df.drop('Unnamed: 0.1', axis='columns')
# Split data (60% in training, 40% in testing)
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.40, random_state=42)
train.shape
# Check the distribution of variables
var = pd.DataFrame(train.dtypes).reset_index()
var.columns = ['varname','dtype']
var['source'] = var['varname'].str[:2]
var['source'].value_counts()
# Caregorize variables, and remove a bad data field: AP004
MB_list = list(var[var['source']=='MB']['varname'])
AP_list = list(var[(var['source']=='AP') & (var['varname']!='AP004')]['varname'])
TD_list = list(var[var['source']=='TD']['varname'])
CR_list = list(var[var['source']=='CR']['varname'])
PA_list = list(var[var['source']=='PA']['varname'])
CD_list = list(var[var['source']=='CD']['varname'])
# Check values in AP list
AP_list
# Check the distribution of y in train
train['loan_default'].value_counts()
What is Random forest ?
Random forest is a type of bagging, which avoids the problem of overfitting by randomly selecting multiple subsets and averaging the results of each subset tree model.
# Install package
!pip install h2o
import h2o
h2o.init()
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
# Set important parameters
target='loan_default'
predictors = CR_list + TD_list + AP_list + MB_list + CR_list + PA_list
train_smpl = train.sample(frac=0.1, random_state=1)
test_smpl = test.sample(frac=0.1, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
# Apply random forest model
rf_v1 = H2ORandomForestEstimator(
model_id = 'rf_v1',
ntrees = 350, # modify 1: higher AUC
nfolds=10, # modify 2: no difference for this dataset
min_rows=100,
seed=1234)
# Use the cut of training dataset
rf_v1.train(predictors,target,training_frame=train_hex)
# Visualize the result of the importance of the varibales
def VarImp(model_name):
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# plot the variable importance
plt.rcdefaults()
variables = model_name._model_json['output']['variable_importances']['variable']
y_pos = np.arange(len(variables))
fig, ax = plt.subplots(figsize = (6,len(variables)/2))
scaled_importance = model_name._model_json['output']['variable_importances']['scaled_importance']
ax.barh(y_pos,scaled_importance,align='center',color='green')
ax.set_yticks(y_pos)
ax.set_yticklabels(variables)
ax.invert_yaxis()
ax.set_xlabel('Scaled Importance')
ax.set_title('Variable Importance')
plt.show()
VarImp(rf_v1)
Based on the visualization results, it can be seen that the top 3 variables with the greatest influence on the results are TD013, MB007, and TD009, and that the weight of these three variables is much higher than that of the other variables. TD009 is approximately twice the weight of the fourth variable TD005.
# Prediction
predictions = rf_v1.predict(test_hex)
predictions.head()
test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()
test_scores.head()
def ROC_AUC(my_result,df,target):
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# ROC
y_actual = df[target].as_data_frame()
y_pred = my_result.predict(df).as_data_frame()
fpr = list()
tpr = list()
roc_auc = list()
fpr,tpr,_ = roc_curve(y_actual,y_pred)
roc_auc = auc(fpr,tpr)
# Precision-Recall
average_precision = average_precision_score(y_actual,y_pred)
print('')
print(' * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate')
print('')
print(' * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
print('')
print(' * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
print('')
# plotting
plt.figure(figsize=(10,4))
# ROC
plt.subplot(1,2,1)
plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (area=%0.2f)' % roc_auc)
plt.plot([0,1],[0,1],color='navy',lw=3,linestyle='--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic: AUC={0:0.4f}'.format(roc_auc))
plt.legend(loc='lower right')
# Precision-Recall
plt.subplot(1,2,2)
precision,recall,_ = precision_recall_curve(y_actual,y_pred)
plt.step(recall,precision,color='b',alpha=0.2,where='post')
plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0,1.05])
plt.xlim([0.0,1.0])
plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
plt.show()
ROC_AUC(rf_v1,test_hex,'loan_default')
train_h20 = h2o.H2OFrame(train)
test_h20 = h2o.H2OFrame(test)
rf_v2 = H2ORandomForestEstimator(
model_id = 'rf_v2',
ntrees = 350,
nfolds=10,
min_rows=100,
seed=1234)
rf_v2.train(predictors,target,training_frame=train_h20)
VarImp(rf_v2)
Based on the visualization results, it can be seen that the top 3 variables with the greatest influence on the results are TD013, AP003, and MB007, and that the weights of these 3 variables are not much higher than that of the other variables. The result of the whole dataset is quite different from the result of train_hex.
# Predict
predictions = rf_v2.predict(test_h20)
test_scores = test_h20['loan_default'].cbind(predictions).as_data_frame()
test_scores.head()
createGains(rf_v2)
The purposes of gains table is to help analyst select the model with better performance, and decide the segments should be focused on.
According to the results, combining the mortgage business insight, there will be 33.44% (0.3344 * 320 = 107) customer in decile 0 are bad loan applicants.
Since gain = cum_actual / cum_count, the bank could avoid 74.38% (1 - 25.62%) of bad loan applicants by avoiding lending to applicants in Decile 0 through Decile 4.
(cum_actual refers to the cumulative number of people who are actually unable to repay the loan.)
lift = cum_actual / if_random, which shows how much the bank can avoid bad applicants compared to banks randomly granting loans to customers.
Based on the results of the linear visualization between lift and Decile, it can be seen that the slope is most dramatic between Decile 1 to Decile 2. Therefore, from the lift perspective, the bank should avoid applicants located through Decile 0 to Decile 2.
K-S = |Cumultative % good — Cumulative % bad|, which measures the degree of separation between the distributions of the good and bad applicants.
Frist of all, the K-S value for each deciles is not all 0, therefore the model effective to separate bad applicants from all customers. According to the figure above, I also find that the K-S values of Decile 4 and 5 reach to the maximum value of 0.22. In other words according to the results of K-S, the bank should avoid applicants located through Decile 0 to Decile 4 or 5.
ROC_AUC(rf_v2,test_h20,'loan_default')
Interpretation:
Receiver operating characteristic (ROC) curve visualizes the accuracy of predictions for a whole range of cutoff values. The AUC value is 0.66 which means the performance of the model is poor.
Precision-Recall (PR) curves, on the other hand, is more informative than ROC when dealing with highly skewed datasets. Since the value of average precision for the PR curve is only 0.3071, the dataset is skewed and needed to be balanced.
As analyzed above, the dataset is imbalanced. Because the data with y=1 is much less than the data with y=0. And such data is difficult to measure through ROC, because ROC is linear relationship between sensitivity and specificity. when one type of samples are significantly less than the other, single sample of under-sampling may have a huge effect on these two values, so ROC will have a large deviation. This is also the reason why this imbalanced dataset needs to further sampling.
H2O package can balance the dataset automaticly, processing both the under-sampling part and over-sampling part.
rf_v3 = H2ORandomForestEstimator(
model_id = 'rf_v3',
ntrees = 350,
nfolds=10,
min_rows=100,
balance_classes = True,
seed=1234)
rf_v3.train(predictors,target,training_frame=train_h20)
createGains(rf_v3)
ROC_AUC(rf_v3,test_h20,'loan_default')
The values in the Gains table for model rf_v3, as well as the AUC value and average precision value are similar to rf_v2. Nothing has changed by using H2O package to balance the dataset.
In this part, I would like to double the under-sampling data and testify whether this adjustment can improve the performance of the random forest model.
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
from collections import Counter
# Split data into testing(33%) and training(67%)
features = ['AP001', 'AP002','AP003', 'AP004', 'AP005', 'AP006', 'AP007', 'AP008', 'AP009', 'TD001',
'TD002', 'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD015',
'TD022', 'TD023', 'TD024', 'TD025', 'TD026', 'TD027', 'TD028', 'TD029',
'TD044', 'TD048', 'TD051', 'TD054', 'TD055', 'TD061', 'TD062', 'CR004',
'CR005', 'CR009', 'CR012', 'CR015', 'CR017', 'CR018', 'CR019', 'PA022',
'PA023', 'PA028', 'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071',
'CD072', 'CD088', 'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113',
'CD114', 'CD115', 'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130',
'CD131', 'CD132', 'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153',
'CD160', 'CD162', 'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172',
'CD173', 'MB005', 'MB007']
X = df[features]
y = df[target]
y1_cnt = df[target].sum()
y1_cnt
N = 2
y0_cnt = y1_cnt * N
y0_cnt
!pip install imblearn
from imblearn.datasets import make_imbalance
X_loan, y_loan = make_imbalance(X, y,
sampling_strategy={1:y1_cnt, 0:y0_cnt},
random_state=0)
X_loan = pd.DataFrame(X_loan)
y_loan = pd.DataFrame(y_loan)
y_loan = df[df[target]==1]
X_loan = df[df[target]==0].sample(n=y0_cnt)
smpl = pd.concat([X_loan,y_loan])
smpl_hex = h2o.H2OFrame(smpl)
rf_v4 = H2ORandomForestEstimator(
model_id = 'rf_v4',
ntrees = 350,
nfolds=10,
min_rows=100,
seed=1234)
rf_v4.train(predictors,target,training_frame=smpl_hex)
ROC_AUC(rf_v4,test_hex,'loan_default')
By doubling the data in the under-sampling section, or say the bad applicants, the values of both AUC and average precision increased while keeping the same parameters. They respectively increased from 0.65 to 0.69 and from 0.31 to 0.34.
ROC_AUC(rf_v4,smpl_hex,'loan_default')
By comparing the results of the training and data sets, we can see that the average accuracy of the training set is significantly greater than that of the test set, indicating that there may still be an overfitting problem in the model.
In this part, I would like to halve the over-sampling data and testify whether this adjustment can improve the performance of the random forest model.
y2_cnt = df[target].count() - df[target].sum()
y2_cnt
N = 2
y3_cnt = y2_cnt / N
y3_cnt = int(y3_cnt)
from imblearn.over_sampling import RandomOverSampler
sampler = RandomOverSampler(sampling_strategy={1: y3_cnt, 0: y2_cnt})
X_loan, y_loan = sampler.fit_resample(X, y)
X_loan = pd.DataFrame(X_loan)
y_loan = pd.DataFrame(y_loan)
y_loan = df[df[target]==1]
X_loan = df[df[target]==0].sample(n=y3_cnt)
smp2 = pd.concat([X_loan,y_loan])
smp2_hex = h2o.H2OFrame(smp2)
rf_v5 = H2ORandomForestEstimator(
model_id = 'rf_v5',
ntrees = 350,
nfolds=10,
min_rows=100,
seed=1234)
rf_v5.train(predictors,target,training_frame=smp2_hex)
ROC_AUC(rf_v5,test_hex,'loan_default')
By halving the data in the over-sampling section, or say the good applicants, the values of both AUC and average precision increased while keeping the same parameters. They respectively increased from 0.65 to 0.7 and from 0.31 to 0.34.
Both the ROC result and the PR result performed better compared to adjusting for under-sampling. Therefore, further analysis of the model after adjusting for the over-sampling will follow.
Further analysis of the model after adjusting for the under-sampling.
createGains(rf_v5)
According to the results, combining the mortgage business insight, there will be 41.88% (0.4188 * 320 = 134) customer in decile 0 are bad loan applicants.
Since gain = cum_actual / cum_count, the bank could avoid 73.91% (1 - 26.09%) of bad loan applicants by avoiding lending to applicants in Decile 0 through Decile 5.
Comparison with the results before the adjustment shows that the adjusted model identifies more frauds in the first few deciles.
lift = cum_actual / if_random, which shows how much the bank can avoid bad applicants compared to banks randomly granting loans to customers.
Based on the results of the linear visualization between lift and Decile, it can be seen that the slope is most dramatic between Decile 3 to Decile 4. Therefore, from the lift perspective, the bank should avoid applicants located through Decile 0 to Decile 4.
The suggestion is more close to the suggestion from gains table than the before model.
Frist of all, the K-S value for each deciles is not all 0, therefore the model effectively separate bad applicants from all customers. According to the figure above, I also find that the K-S value of Decile 5 reach to the maximum value of 0.29. In other words according to the results of K-S, the bank should avoid applicants located through Decile 0 to Decile 5.
Summary:
Taking all the above analysis into account, I think if the bank wants to expand its loan business, it can consider applicants from Decile 5 to Decile 9; if the bank wants to recover most of the loans securely and ensure profitability, it should consider applicants from Decile 6 to Decile 9.